SmartCrawler - Examples

Examples

This guide will show you how to smartly and quickly crawl the contents of a site by providing some examples. All the examples are placed inside the directory examples and can be run as follows:

run-example.bat "exampleDirName"

where "exampleDirName" is the name of the directory containing the example configuration files.

Google images

The "Google Images" example can be lauched as follows:

run-example.bat googleImages

and makes you able to fetch from http://images.gooogle.com just the images found by searching for the word "crawler". You can easily edit and modify this script in order to change the keyword to search for.

The custom configuration file used by this example is named google_images-config.xml and it can be found in $SMARTCRAWLER_HOME/examples/googleImages/conf:

<?xml version="1.0" encoding="UTF-8"?>
<smartcrawler>

<engine>
    <threadsNumber>5</threadsNumber>
</engine>

<loggers>
    <logger type="TRACER" active="yes"/>
    <logger type="ACCESS" active="yes"/>
    <logger type="LINK" active="yes"/>
    <logger type="PERMISSIONS" active="yes"/>
    <logger type="EXTRACTOR" active="yes"/>
    <logger type="CONSOLE" active="yes"/>
    <logger type="PERSISTER" active="yes"/>
    <logger type="PROVIDER" active="yes"/>
</loggers>

<retriever>
    <class>org.smartcrawler.retriever.MultiThreadHttpCallRetriever</class>    
    
    <filters>
        <filter>
            <name>LinkFilter</name>
            <class>org.smartcrawler.filter.LinkFilter</class>
            <priority>5</priority>
            <filter-param>
                <param-name>links</param-name>
                <param-value>
                    *images.google.it/images* *.gif *.GIF *.jpg *.JPG
                </param-value>
            </filter-param>
         </filter>
    </filters>
</retriever>
    
<persister>
    <class>org.smartcrawler.persistence.FileSystemPersister</class>
    <persister-params>
        <persister-param>
            <param-name>preservePath</param-name>
            <param-value>true</param-value>
        </persister-param>
        <persister-param>
            <param-name>rootDir</param-name>
            <param-value>google</param-value>
        </persister-param>
    </persister-params>
    <filters>
        <filter>
            <name>ImagesLinkFilter</name>
            <class>org.smartcrawler.filter.ContentTypeLinkFilter</class>
            <priority>1</priority>
            <filter-param>
                <param-name>mime-type</param-name>
                <param-value>image</param-value>
            </filter-param>
        </filter>
    </filters>
</persister>
    

</smartcrawler>

New york times RSS

The "New york times RSS" example can be lauched as follows:

run-example.bat nytRss

It will navigate the web site http://www.nyt.com in order to retrieve from it all the existing rss feeds.

The custom configuration file used by this sample is called nyt_rss-config.xml and it can be found in $SMARTCRAWLER_HOME/examples/nytRss/conf:

<?xml version="1.0" encoding="UTF-8"?>
 <smartcrawler>

<engine>
    <threadsNumber>5</threadsNumber>
</engine>

<loggers>
    <logger type="TRACER"       active="no"/>
    <logger type="CONSOLE"      active="yes"/>
    <logger type="ACCESS"       active="no"/>
    <logger type="LINK"         active="no"/>
    <logger type="EXTRACTOR"    active="no"/>
    <logger type="PROVIDER"     active="no"/>
    <logger type="PERMISSIONS"  active="no"/>
    <logger type="PERSISTER"    active="no"/>
</loggers>

<retriever>
    <class>org.smartcrawler.retriever.MultiThreadHttpCallRetriever</class>    
    
    <filters>
        <filter>
            <name>DefaultLinkFilter</name>
            <class>org.smartcrawler.filter.DefaultLinkFilter</class>
            <priority>1</priority>
        </filter>
        <filter>
            <name>LinkFilter</name>
            <class>org.smartcrawler.filter.LinkFilter</class>
            <priority>2</priority>
            <filter-param>
                <param-name>links</param-name>
                <param-value>
                    */rss*
                </param-value>
            </filter-param>
         </filter>
    </filters>
</retriever>
    
<persister>
    <class>org.smartcrawler.persistence.FileSystemPersister</class>
    <persister-params>
        <persister-param>
            <param-name>preservePath</param-name>
            <param-value>true</param-value>
        </persister-param>
        <persister-param>
            <param-name>rootDir</param-name>
            <param-value>nyt</param-value>
        </persister-param>
    </persister-params>
    <filters>
        <filter>
            <name>XMLLinkFilter</name>
            <class>org.smartcrawler.filter.ContentTypeLinkFilter</class>
            <priority>1</priority>
            <filter-param>
                <param-name>mime-type</param-name>
                <param-value>xml</param-value>
            </filter-param>
        </filter>
    </filters>
</persister>
    

</smartcrawler>

Getting Started

Project Documentation

Examples

Google images

New york times RSS